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Efficient Solution of Time-Domain 
Boundary Integral Equations Arising in 
Sound-Hard Scattering 

A. Veit* M. Mortal J. Zapletal^ D. Lukas§ 


Abstract 

We consider the efficient numerical solution of the three-dimensional wave equation 
with Neumann boundary conditions via time-domain boundary integral equations. A 
space-time Galerkin method with C' oc -smooth, compactly supported basis functions 
in time and piecewise polynomial basis functions in space is employed. We discuss 
the structure of the system matrix and its efficient parallel assembly. Different pre¬ 
conditioning strategies for the solution of the arising systems with block Hessenberg 
matrices are proposed and investigated numerically. Furthermore, a C++ imple¬ 
mentation parallelized by OpenMP and MPI in shared and distributed memory, 
respectively, is presented. The code is part of the boundary element library BEM4I. 
Results of numerical experiments including convergence and scalability tests up to 
a thousand cores on a cluster are provided. The presented implementation shows 
good parallel scalability of the system matrix assembly. Moreover, the proposed al¬ 
gebraic preconditioner in combination with the FGMRES solver leads to a significant 
reduction of the computational time. 

AMS subject classifications. 35L05, 65N38, 65R20, 65F08 


1 Introduction 

We are concerned with the efficient numerical solution of time-dependent scattering phe¬ 
nomena in unbounded domains. Specifically, we consider the time-dependent, three- 
dimensional wave equation in the case of a sound-hard scatterer modelled by Neumann 
boundary conditions. We formulate and solve the arising problem in terms of time-domain 
boundary integral equations (TDBIEs). The main advantage of this approach is that the 
problem (originally posed in a three-dimensional unbounded domain) is reduced to the 
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two-dimensional (bounded) surface of the scatterer. Different discretization techniques 
have been introduced and intensively studied to efficiently solve TDBIEs. These methods 
are typically based on the Galerkin discretization in space and utilize either collocation, 
convolution quadrature or a Galerkin scheme for the time discretization. We refer to 
p n ei ttnum and the references therein for an overview of existing methods and to m 
for a detailed introduction to the theory of time-domain boundary integral equations. 

Here, we will employ the space-time Galerkin method to solve the arising equations 
numerically. This approach originated from the groundbreaking work of Bamberger and 
Ha Duong M. They introduced coercive space-time variational formulations for acous¬ 
tically soft and hard scatterers and showed stability and convergence of the resulting 
Galerkin schemes with piecewise polynomial ansatz spaces. The main difficulty in this 
approach is the accurate computation of the matrix entries. If standard piecewise poly¬ 
nomial basis functions are used, special quadrature techniques taking into account the 
complicated shape of the arising 4-dimensional integration domains are necessary. To 
circumvent this problem, we use C^-smooth and compactly supported basis functions 
for the time-discretization. These were introduced in [18, 20J for the Dirichlet problem 
and it was shown that this choice allows an accurate approximation of the matrix entries 
using standard quadrature schemes. As the consequence of the simplified computation of 
the system matrix the use of nonequidistant timesteps and higher order approximation 
spaces in time became feasible [19] . 

In this paper we employ the same type of ansatz functions for the Neumann problem 
using the variational formulation derived in |3]. We focus on equidistant timesteps and 
investigate the implications on the structure of the system matrix. Due to the overlap of 
the temporal basis functions the resulting linear system that needs to be solved admits a 
block Hessenberg structure. We compare a conventional GMRES solver with a GMRES 
solver preconditioned by deflations (see 0 ) and show numerically that the necessary 
number of iterations is significantly reduced. In Section 4.2 we furthermore introduce an 
experimental algebraic preconditioner that exploits the block Hessenberg structure of the 
system matrix. We perform various numerical experiments that show the performance of 
this preconditioner in combination with the FGMRES method developed in [IB] . 

We present an efficient parallel implementation of the above-mentioned discretization 
and solution strategies in Section [5] The code is a part of the boundary element library 
BEM4I [14] . It is based on C++ and uses hybrid parallelization by OpenMP and MPI 
to accelerate the evaluation of the discretized space-time boundary integral operators. 
The implementation leverages the repeating pattern of the system matrices to minimize 
computational cost as well as memory requirements. We briefly describe the structure of 
the BEM4I library and provide details about the parallel system matrix assembly in Sec¬ 
tion 5.1 Results of numerical experiments demonstrating scalability of the computation 


in a distributed memory architecture are provided in Section [6] 


2 Integral Formulation of the Wave Equation 

Let 1? C M 3 be a Lipschitz domain with the boundary denoted by T. We consider the 
homogeneous wave equation 


d^u — Au = 0 in 17 x [0, T] 


( 2 . 11 ) 
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with homogeneous initial conditions 


u(-, 0) = dtu(-, 0) = 0 in Q ( |2.1| b) 

and Neumann boundary conditions 

du 

d n u := — = g on r x [0, T] ( |2d| ') 

on a time interval [0, T] for T > 0, where n denotes the unit outward normal vector. In 
applications, f2 is often the unbounded exterior of a bounded domain. For such problems, 
the method of boundary integral equations is an elegant tool where the partial differential 
equation is transformed to an equation on the bounded surface T. 

We employ an ansatz as a double layer potential for the solution u, 


u(x,t) :=D(j)(x,t ) := — 


1 f ny(x-y) {<i>(y,t - \\x - y\\) 


4vr J r \\x-y\\ 
d t (/>{y,t- ||x - 


-+ 


\ x - y II 


\\x - y II 2 

^ d r y , (. x,t ) € i?\T x [0, T] (2.2) 


with an unknown density function cj). D is also referred to as retarded double layer potential 
due to the retarded time argument t — \\x — y\\ connecting the time and space variables. 

The ansatz (2.2) satisfies the wave equation (2.1 1 ) and the initial conditions (2.1 r). 
Therefore, the unknown density function cj) has to be determined such that the Neumann 
boundary conditions (2.1:) are satisfied. For this the normal derivative of the double 


layer potential has to be extended to the boundary T which can be done continuously 
for sufficiently smooth functions across smooth points of r. We therefore define the 
hypersingular operator 


Wv{x,t) := lim n x 


V x +Dv(x + ,t) 


(2.3) 


for (x,t) 6fx [0, T], where the limit is taken in the sense of distributions. In order to 


find the unknown density function cj) in (2.2) such that (2.1:) is satisfied we thus consider 
the boundary integral equation 


W(j) = g on T x [0, T\. 


(2.4) 


To solve this equation numerically we introduce a weak formulation of (2.4) following [3]. 
A suitable space-time variational formulation is given by: Find (j) hi a Sobolev space V 
such that 


a(4>, C) := 


r i>(y,t- \\x - y\\)C(x,t) 


o Jr Jr l 47 t||x — y\\ 
curl r<j>(y,t- \\x - y ||) • curlrCOM) 


+ 

f, 

/o Jr 


Ar\\x — y || 
g(x, t)C(x, t ) d r x d t =: 6(C) 


d r y dr x d t 


(2.5) 
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for all £ G V, where we denote by </> and cj) the first and second derivatives with respect 
to time. Here, curlp^ is the tangential rotation of the function cj) defined as 

cuAp^x) := n x X V^>(x), 

where cj) is defined in a tubular neighbourhood of T by 

4>{x + en x ) := (j>(x) 

for i£f (see also mm)- 

3 Numerical Discretization 


We discretize the variational problem (2.5) using a Galerkin method in space and time. 


Therefore, we replace the infinite dimensional space V by a finite dimensional subspace 
^Galerkin spanned by L basis functions {6j}^ =1 in time and M basis functions {}y=i i n 
space. This leads to the discrete ansatz 


L M 


^Galerkin (®j t ) EE a{pj(x)bi(t), (x,t) G r x [0,T] , 
i= 1 3 = 1 


(3.1) 


with the unknown coefficients aj. Plugging (3.1) into the variational formulation (2.5) 


and using the basis functions b & and ipi as test functions leads to the linear system 

A QL = g, (3.2) 

where the block matrix A G -^LMxLM^ un k nown coefficient vector a. G and the 
right-hand side vector g G can be partitioned according to 


(3.3) 



'Ai,i 

Ai.2 • 

• A.i,l 


cti 


"gl" 

A : = 

A2,l 

A2,2 • 

■ A 2 ,l 

, a := 

a 2 

. g : = 

g2 


Al, 1 

Al,2 • 

1 

• ^ 

< 


_C*L_ 


g L_ 


with 


A fcii G M MxM , 


a.i G M M , gk G M M for i, k G {1, ■ ■ • , L}. 


Individual entries are given by 

rT 




'o Jr Jr l 47 t||x — y|| 

curl r<Pj(y) ■ cun r <Pi(x 


+ - 


47r||x — y || 


- \\x - y ||) (fi(x)bk(t ) 
bi[t - \\x - y\\)b k (t) \ dr y dr x dt 


(3.4) 


and 


a i(j) = ("Oj-i ’ = J J r d( x ,t) <Pl(x) b k (t) dr x dt, 
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respectively. We rewrite (3.4) by introducing univariate functions 


:= / M* “ r)b k (t)dt 
Jo 

(see [[19] ) and obtain 

A-k,i ii 1 0 


ipk,ii r ) '= r)b k (t)dt 

Jo 


(3.5) 


/ 


Tlx ’ Tin. 


r Jr 


1 Til- V ~T\ wiy) <Pi( x )' t l>k,i(\\ x - 

l 4vr||cc - y|| 


+ 


curl r <Pj(y) • curl r y?;(.T) 
4tt||x - y\\ 


4>k,i(\\x - 


d r y d r x 


(3.6) 


The accurate computation of the matrix entries (3.6) is problematic in the space-time 
Galerkin approach. If piecewise polynomial basis functions in time are used as proposed 
in [3], the integrand in (3.6) is only piecewise smooth which makes standard quadrature 
techniques prohibitively expensive. In |il8j . C^-smooth and compactly supported tem¬ 
poral shape functions were proposed. It could be shown (see hh la ns]) that this 
choice significantly simplifies the accurate approximation of integrals as in (3.6) and as 
a consequence allows the use of nonequidistant stepsizes as well as higher-order approxi¬ 
mation spaces in time. Since our goal in this paper is a fast solver, we restrict ourselves 
to equidistant timesteps as the computational complexity and the memory requirements 
are significantly lower in this case. We denote the timesteps by 


ti := i ■ At, with At : = 


N - 1 


, i = 0,...N- 1, 


where N is the number of timesteps. In the following we briefly recall the definition of 
the temporal basis functions for the special case of equidistant timesteps. We define 


f(t) := 


2 erf (2 artanh t) + \ 


1*1 < 1 . 
t < -1, 
t > 1 


and note that / £ C°° (M). We scale and shift / in order to obtain a (left) cutoff function 

where fa (t) = 


fa ( t ) := / ( 2^ _ i 


At 


0 t <U, 

1 t>t i+ 1. 


We obtain a bump function on the interval [U-i, ii+i] with midpoint ti by 

Pi (*) := 


fa-i (t) U-i <t<U, 
1 - fa (t) U <t< ti+ 1, 
0 otherwise. 


A smooth partition of unity of the interval [0, T] is then defined by 

Pi:=l-fo , PN ■= /iV-2, Vi e {2,...,N-1} : Pi := Pi-1. 

Smooth and compactly supported basis functions in time can now be obtained by mul¬ 
tiplying these partition of unity functions with suitably scaled Legendre polynomials P m 
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of degree m (see (18j for details): 



bi, m ■= 2 - 1 ^ 


bN,m ■= I^N (i)P m ^2 — ^ 2 - 1 ^ 


m = 0,...,p, 

m = 0,... ,p, i = 2,..., N — 1, (3-7) 

m = 0,. .. ,p, 


where p controls the order of the method in time. We will use the above basis functions in 
time for the Galerkin approximation (3.1). Note that the definition is slightly different to 
the one in 


fT8| . Here we simply use p+1 basis functions in the first interval. This choice 
leads to a better asymptotic convergence rate of the method in the interval [to, H] since we 
have the same number of basis functions as for the other time intervals but additionally 
use the a priori knowledge about the solution u(-,0) = dtu(-,0) = 0, which is reflected in 
the factor t 2 in the basis functions (see HI). This choice will simplify the implementation. 
In order to use the functions (3.7) as basis functions in the discrete ansatz (3.1) a suitable 
numeration has to be introduced. Here we use 


bi := b r , i „ » for i = 1,..., L = N ■ p. 

|4l|,modh—l,p+l) ’ ’ ‘ 

For the discretization in space we use standard piecewise polynomial (typically linear) 
basis functions <pj. 


3.1 Efficient representation and evaluation of ipk,i ar| d ipk,-, 


An efficient handling of the functions an d V’fc,! i n (3.5) is crucial for a successful 


implementation of the algorithm since they have to be evaluated numerous times during 
the approximation of the matrix entries ( |3.6[ ). In |19] we propose to approximate this 
type of functions for each k,i £ {1 ,..., L} with piecewise Chebyshev polynomials. This 
approximation is efficient due to the smoothness of 'ipk j and ? ; and it can be evaluated 
efficiently with Clenshaw’s recurrence formula [5]- Here we furthermore exploit the fact 
that we only use constant timesteps. The number of approximations that have to be 
precomputed in this case is only 0(p 2 ) compared to 0(N 2 p 2 ) for variable stepsizes in 
time. 

Let the indices i,k E {1,...,L} be arbitrary but fixed and let i.k £ {l,...,A r }, 
mi, m 2 E {0,... ,p} be such that 


bi = h 


i,mi ’ 


bk = b 


k,m 2 ’ 


(3.8) 


We first consider the case where 2 < i,k < N — 1, i.e., basis functions that are associated 
with “inner” timesteps. Then simple calculus shows that 

<M r ) = f hmS 1 - r )^,m 2 W dt 

& 2 ,mi (t — tj _2 — r)b 2 ,m 2 (t ~ ^ 

= : im 1 ,m 2 ( r + tj-2 ~ ^- 2)1 ( 3 - 9 ) 
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where 


r2At 

£mi,m 2 (oO •— / ^2,mi if G0^2,m 2 (^) dt. 

J 0 


Thus the task of approximating tby for p+2 < k,i < L—p—2 is reduced to approximating 
£mi,m 2 for 0 < mi, m 2 < p on its support [—2Z\i,2Z\f] and evaluating these functions 
according to (3.9). Completely similarly we obtain that 

/•2/H _ 

Vv( r ) = Ui,m 2 (r + - t^_ 2 ), with £ mi , m2 (a) := / 6 2 ,mi(i - Ot)b2,m 2 (t) dt. 

Jo 



Figure 3.1: £ mii m 2 and £m 1 ,m 2 for At = 1. For visualization purposes £ 0,0 and £ 1,0 were multi¬ 
plied by a factor 8. 


Figure 3.1 illustrates the prototype functions £ mi , m2 and £m 1 ,m 2 for 2\t = 1. It suggests 


that these functions are either even or odd depending on m\ and m 2 • Indeed it is easy 
to see that 

£™(«) = (-1 r i+m2+1 | mi>m2 (-a) (3.10) 


due to the symmetry of the basis functions b 2 ,m(t) with respect to t = At. The same 
formula holds for £ mi , m2 . As a consequence, approximations of £mi,m 2 and £mi,m 2 (e.g., 
with piecewise Chebyshev polynomials) have only to be computed and stored for a E 
[0, 2At] since 


$k,i(r) = sign (r + tj_ 2 - t^._ 2 ) mi+m2+1 | mijm2 


(l r + ti-2-*fc- 2 |) > 


(3.11) 


where k,k,m 2 and are related via (3.8). An analogous formula holds for ifk,i- 

Lastly we want to point out a symmetry with respect to m\ and m 2 . Partial integration 
yields 

(mi ,m 2 ( C&0 — ?m 2 ,mi(o) and (mi,m. 2 ( Of) — (m 2 ,mi W' (3.12) 

Therefore, only ^(p + l)(p + 2) functions £ mijTO2 and £ mii m 2 have actually to be approx¬ 
imated. More importantly, this relation has an impact on the structure of the system 


matrix A as we will show in Section 3.2 
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So far we only considered the case 2 < i,k < N — 1. If the basis functions are associated 
with the first or the last timestep, similar prototype functions can be defined. Since this 
is analogous to the procedure described above we omit the details here. 


3.2 Structure of the system matrix A 


The solution of (2.5) using the discrete ansatz (3.1) leads to a linear system with L ■ M 


unknowns. The special choice of basis functions in time as well as the use of equidistant 
timesteps has several implications on the structure of the system matrix A. Throughout 


this section we assume again that k,k,m 2 and i,i,mi are related via (3.8). We denote 


the matrix block Afrom (3.4) by A 1 ?-’™ 1 to highlight the dependence on the timestep 


and the order of the involved basis functions in time. Furthermore, we define the matrix 
block 

r a 0,0 ... 


A ki' 


k,i 


A?’ 0 

k.i 


k.i 


A?’? 

k.i. 


j(p+l)Mx(p+l)M^ 


(3.13) 


We first remark that 

supp'i/’fc.i = supp 1pk,i = [t~ k _ 2 - tj,t k - f-_ 2 ] PI M> 0 , 
where we formally set t-± = 0 and t]\r = T. Thus 

ipk,i = i>k,i = 0 on M for tj, < tj_ 2 . 

This shows that the resulting system matrix A 6 R. LMxLM a( ] m j| s block Hessenberg 
structure 


A = 


' A,,! 

Al,2 0 

0 

0 ■ 

A24 

A2,2 A 2,3 

0 

0 




0 

Aat-^i 

Aat-i^ 


■ ■ ■ AjV— 1,7V 

. Ajv,i 

AjV,2 


A n,n . 


(3.14) 


where 0 6 ]j(p+ 1 ) Mx (p+ 1 ) m denotes the zero matrix. An important consequence of (3.11) 


is that for 2 < i,k < N — 1 the functions 'i[j k % and j only depend on the difference 


tj_ 2 — t k _ 2 and therefore 


A k,i = A k-i+ 2,2 for k ~ ^ 0 and A k,I = A 2 ,3 for k - i = -1. 


(3.15) 


Thus only 0(N ) matrix blocks A^ have actually to be computed and stored. In order 


to further reduce the complexity we remark that for 2 < i,k < N — 1 the formula (3.12) 


together with (3.10) shows that 


k,i ^ k,i 


(3.16) 
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In order to represent A ^ we thus only have to compute and store \(p+l){p + 2) matrix 
blocks A r ™‘~' rn '. Finally we remark that the blocks A'^ l ~’ mi themselves are sparse (see also 

mm) since 

A™-2’ mi (jiO = 0 if supp^j Pi [mindist,maxdist= 0, (3-17) 

where 


mindistyz := min{||a; — y\\, x G suppt/?;, y G supp^} 
maxdistjj := max{||x — y ||, x G supp(^/, y G supp (/?.,■} . 


k.i 


= 0 and therefore 


Note that in the case T > diarn(T), (3.17) implies that 
= 0 if h~2 ~ t i> diam ( r )- 

It is clear that a computational and memory efficient implementation should avoid the 
explicit construction of A. Instead, only those matrix blocks A ~ n ~ ,m 1 should be computed 
that are necessary to recover arbitrary entries of A via (3.15) and (3.16). Furthermore, 
these blocks have to be stored in a suitable sparse matrix storage format. Since the 
solution of the arising linear system (3.2) is typically computed using iterative solvers it 
is also necessary to develop a routine for the efficient multiplication of A with a vector 
that incorporates the special structure of A. 


4 Solution of the linear system 

In contrast to classical schemes using piecewise polynomial temporal basis functions the 
resulting system matrix is not lower triangular with Toeplitz structure and FFT-type 
methods cannot be used for the solution of the linear system. Therefore an efficient 
iterative solver has to be utilized and an appropriate preconditioner of the system has to be 
developed. The GMRES algorithm is one of the possible choices HZ]. However, since the 
system matrix A is global in time and space and therefore of large dimension, the restarted 
version of the algorithm GMRES(m) is usually necessary to keep the computational and 
memory requirements reasonable. To speed up the convergence we present two possible 
preconditioning techniques based on deflation and a recursive algebraic approach. 

4.1 Restarted GMRES preconditioned by deflation 

It is known that restarts of the algorithm lead to a slower convergence than in the case 
of the full-GMRES due to the loss of information on the smallest Ritz values [12]. The 
restarted GMRES preconditioned by deflation (DGMRES(m,/)) aims to keep this infor¬ 
mation by approximating the invariant subspace corresponding to the smallest eigenvalues 
of the system matrix. After each restart the invariant subspace is increased by l new vec¬ 
tors. It has been observed that DGMRES(m, l ) can restore the convergence even in cases 
where the original restarted algorithm stalls |7]. 

Let | Ai| < | A2 1 < ■ • • < \Xlm\ be the eigenvalues of A. The desired preconditioner has 
the form 

M := I LM + U(1/|A lm |T - I r )U T , 
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where T := U 2 AU, I lm, I r are identity matrices of appropriate dimensions, and U is an 
orthonormal basis of the invariant subspace corresponding to the smallest r eigenvalues 
of A. It can be proven that 

M- 1 = I LM + UdAiMlT- 1 - I r )U r 

and the eigenvalues of AM -1 are A r +i, X r +2 , ■.., A lm, \Xlm\- Thus, the smallest r eigen¬ 
values of A are removed and replaced by |Ax,m| with multiplicity r. 

Since it is inefficient to assemble the preconditioner M accurately, we aim to set up its 
suitable approximation M. The GMRES algorithm produces the basis Vfc of the current 
Krylov subspace and the Hessenberg matrix ER = AV^ representing the restriction 
of A onto this subspace. In order to construct M, the Schur decomposition of H^, is 
performed. Using the Schur vectors S corresponding to the smallest eigenvalues of 
we approximate the Schur vectors of A as U ~ V^-S. Moreover, the largest eigenvalue of 
H k is used to approximate the largest eigenvalue of A. 

Using this procedure the algorithm increases U by l vectors after each restart until the 
maximal dimension r of the invariant subspace is achieved. Since the matrices are 
dense and of relatively small dimensions, the Schur decomposition is done efficiently using 
the BLAS routines. Moreover, it is not necessary to explicitly assemble the matrix MU 1 
and its actions are carried out as a sequence of dense matrix-vector multiplications. For 
more detailed description we refer the reader to |JJ. 


4.2 An algebraic preconditioner for block Hessenberg systems 


In this section we propose a preconditioner for the linear system (3.2) that makes use of 
the block Hessenberg structure of A. Let 


AM~ 1 fM • a) = g 


(4.1) 


be the preconditioned system where M denotes the preconditioner. We assume that 


A is partitioned according to (3.14). We choose M to be the matrix that coincides 


with A except that the matrix block A|-jv/2],fJV/2]+l is replaced by a zero matrix of the 
corresponding size, i.e., 


where 


M : = 


Mrp g 

=2,1 M 2 , 2 


M, 1 := Ar T 

-b 1 V fc ’V k,i=l...\N/2] 

M 01 :=(A^)„ 

-2,1 V k=\N/2+l]...N,i:=l...\N/2] 

M 2 ,2 = (^u)fc,i=W/2+ll„.iv' 


This choice of M as a preconditioner in (4.1) is motivated by the Woodbury matrix 


identity which states that the inverse of a rank-fc perturbed matrix can be computed by 
doing a rank-fc correction to the inverse of the original matrix. Since M is a rank -(p +1 )M 
perturbation of A it follows that AM -1 is a rank- (p + 1 )M perturbation of the identity 
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matrix. This suggests that the application of an iterative solver like GMRES is more 


efficient for the preconditioned system (4.1) than for the original one. 

In order to apply the preconditioner within an iterative solver, M -1 is not needed 

explicitly. It is however crucial that the actions of M -1 on a vector r = (r^,r^) T can be 
computed efficiently. Due to the block triangular structure of M it is easy to see that 


M -1 r = 


M ' 1 

= 2,2 




r 2 - 


Therefore, the problem of computing M _1 r boils down to the problem of evaluating 
M~|v and M~*w for some vectors v and w. This is equivalent to the solution of linear 
systems of the form 


Mi i2£ = X 


and 


== 2 , 2 — = —' 


(4.2) 


Note that M and M „ are again block Hessenberg matrices with basically half the 
size of the original matrix A. Thus, these smaller systems can again be solved iteratively 
using a preconditioner of the same form. This strategy can be applied recursively until 
the matrices that have to be inverted are just the diagonal blocks A k,k- On the lowest 
level the resulting linear systems then should be solved either exactly or with a standard 
iterative solver. 


If the inner systems (4.2) and subsequent smaller systems in the recursive process are 


solved very accurately the application of this preconditioner is typically too expensive 
and the resulting solver is too time consuming. Thus we suggest to only employ a limited 


number of iterations of an inner solver to approximate the solution of (4.2). Since the 


classical GMRES algorithm does not allow the preconditioner to change at every iteration 
(and thus forbids the usage of an inexact solver inside the preconditioner) we use a 
flexible inner-outer preconditioned variant of GMRES (FGMRES) presented in ||16] to 
solve both (3.2) and the systems (|4.2|). FGMRES allows a variable preconditioner at the 


cost of double memory requirements but with the same arithmetic complexity. Numerical 
experiments in Section [6] indicate that only a few iterations of the inner FGMRES solver 
during the application of M _1 are necessary to significantly reduce the number of outer 
iterations and the solution time. 

A simplified description of the solver using the recursive preconditioner is listed in Al¬ 
gorithm [l] First, the maximum recursion level of the preconditioner and the number of 
iterations of the inner FGMRES solver are set. Next, the preconditioner is assembled 


and the FGMRES is called to solve the outer system (3.2). During its application the 
preconditioner itself employs iterations of the preconditioned FGMRES to approxi¬ 


mate the solution of the systems (4.2). This repeats recursively until the maximum level 


of recursion is reached. On the lowest level the inner systems are solved with FGMRES 
without a preconditioner. 

The application of this recursive preconditioner is experimental. Although the numer¬ 
ical benchmarks in Section [6] show promising results a rigorous convergence analysis is 
necessary. Moreover, only a sequential version has been tested and its proper paralleliza¬ 
tion is yet to be performed. 
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Algorithm 1 Solution of the system using the recursive preconditioner 
Input: System matrix A, RHS vector g, relative precision e, max. number of iterations m; t 
Set max. recursion depth of the preconditioner m r 
Set max. number of inner iterations 
Initialize the current recursion level r := 1 
AssemblePreconditioner( A, M) 

FGMRES(A, a, g, e, m#, M) 

Output: Solution vector a 


function AssemblePreconditioner(A, M) 

> assembles the preconditioner M for the block Hessenberg matrix A 


Mi,i 

= (A fcli 

kji=X,..., \N/2~\ 

M 2j i 

= (Afc,i 

fc—[1V/2+1]_ ,N,i 

M 2j2 

= (A*,* 

k,i=\N/2+Y\,...,N 

M := 

/Ml,! 

° 1 

\M 2 ,i 

m 2|2 J 


function ApplyPreconditioner(M, x, y) 

> applies the preconditioner M on a vector x and stores the result in y 
Mi.i := (Mfc,i)fe,i=i,... i fjv/2i 

M 2 ,l := (Mk ! i)fc = fJV/2+l],...,jV,i=l,...pV/2] 

M 2 , 2 := (Mfc i i)fc i i = pjv/a+i],...,JV 
x = [xi, x 2 ] T , y = [yi.yap 

if r < m r then 

AssemblePreconditioner(Mi,i, Mi) 
AssemblePreconditioner(M2, 2 , M 2 ) 

else 

Mi = M 2 := I 
r := r + 1 

FGMRES(M U , yi, xi, e , mj?, Mi) 

FGMRES(M 2 .2, y 2 , X 2 - M 2i iyi, e , m \£, M 2 ) 
return y = [yi,y 2 ] T 


function FGMRES(A, x, y, e, m#, M) 

> solves the system Ax = y using FGMRES algorithm with relative precision e, 
maximum number of iterations rriit, and preconditioner M 


5 Implementation 


A parallel solver for sound scattering problems based on the approach described in the 
previous sections has been implemented in the BEM4I library [2j. The code is written 
in C++ in an object oriented way and utilizes OpenMP and MPI for parallelization in 
shared and distributed memory, respectively. All classes are templated to support various 
indexing and scalar types. 

The structure of the solver is depicted in Figure 5.1 Its core consists of a set of classes 
responsible for the assembly of system matrices. 


1. The BESpace class holds the information about the order of spatial and temporal 
basis and test functions. It also stores the object of the underlying surface mesh. 

2. The purpose of the BEBilinearFormWave class is to assemble appropriate system 
matrices. Several matrix types are supported, including non-distributed sparse ma- 
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BESpace 


Back-end 


Front-end 


ScatteringProblem 


BEBilinearFormWave 


BEIntegratorWave 


SurfaceMesh, SparseMatrix, 
MPIBIockMatrix, IterativeSolver, ... 


setlncidentWave(); 

setScattererMesh(); 

setSolver(); 

setOutputFile(); 

solveQ; 



r 



F 

'araViev 

.vtu 


L 


J 


Figure 5.1: Structure of the solver for wave scattering problems in the BEMfl library. 


trices (suitable for small problems solved by a direct solver), non-distributed block 
sparse matrices, and block sparse matrices distributed among computational nodes 
using MPI. The usage of block system matrices reduces memory and computational 
requirements since it does not duplicate assembly and storage of identical blocks. 
The assembly of individual blocks is parallelized by OpenMP. 

3. The BEIntegratorWave class is responsible for the assembly of local system matri¬ 
ces, i.e., the quadrature over pairs of elements. It takes care of evaluation of smooth 
temporal basis and testing functions and their Chebyshev approximations. 

Besides these main classes the solver utilizes classes representing surface meshes, full, 
sparse, and block matrices, direct and iterative solvers (including GMRES, DGMRES, 
and FGMRES), classes for the evaluation of potential operators, etc. A user solves a 
problem either by direct manipulation of the above-mentioned classes or using the inter¬ 
face provided in the ScatteringProblem class. The solution is exported to the ParaView 
.vtu file format. 


5.1 Assembly of distributed system matrix 


Special care has to be taken during the parallel assembly of the system matrix to exploit 
its structure and minimize memory and computational requirements. One has to take into 
account its properties described in Section 3.2, i.e., the block Hessenberg format (3.14), 


duplication of blocks following from (3.15), sparsity and the special structure of individual 


blocks (3.16), as well as the fact that A = 0 for tj,_ 2 > diam(T). In the following 

we describe the assembly of a distributed block system matrix. Similar principles can 
be applied to assemble a non-distributed block matrix suitable for problems of smaller 
dimension. 

To represent a system matrix we use the MPIBIockMatrix class of the BEM4I library. 
The class serves as a simple distributed container for non-distributed matrices, such as 
FullMatrix, SparseMatrix, or any linear operator implementing the Apply method on 
a vector. When an instance of the class is created in the MPI environment, each MPI 
process is only assigned a subset of matrix blocks. The matrix-vector multiplication is 
performed in parallel and the local results are gathered, summed, and distributed among 


all processes. The multiplied vector is replicated on every process (see Figure 5.2) 
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Figure 5.2: Multiplication by a distributed matrix MPIBlockMatrix. 


In the first step of the matrix assembly, the distribution of non-zero blocks -■ among 
P available MPI processes is determined. Assuming as above that the time interval [0, T] 
is discretized into equidistant tinresteps of the length At := T/(7V — 1), the number of 
inner non-zero blocks Ar -,k,i £ {2, ...,7V — 1} can be obtained as 


\ _/y _ fi _ 2 

n z := -(TV 2 - N - 4)- -A -max{7V - n z - 1, 0}, 


where 


n 2 := min < TV. 


2 • diam(T) 

At 


+ 2 


(5.1) 


(5.2) 


is the number of uon-zero blocks in the first matrix column. Each MPI process is assigned 
approximately n z /P inner non-zero blocks. The exact distribution of blocks among pro¬ 
cesses is defined by a mapping R : N x N —> No assigning the MPI rank of the owner to 
every block. Two factors have to be considered when defining R - proper load balance 
during the matrix-vector multiplication and duplication of blocks Aj. 01 k £ {2,...,A — 2} 


and A 2 t 3 due to the relation (3.15). Therefore, the distribution is performed diagonal-wise 
in order to store identical blocks on the same process in one memory location. However, 
some blocks are duplicated among several processes to ensure balanced load during the 
matrix-vector multiplication. An example of the distribution of inner blocks among pro¬ 
cesses is depicted in Figure |5.3| After the assignment of inner blocks, the blocks in the 
first and last rows and columns are mapped to processes with the smallest number of 
blocks assigned. Note that this distribution does not take into account differences in 
fill-in of various blocks. 

After the mapping R is defined, the blocks Ar^ are assembled. Due to the relation 
( |3. 15 ) the total number of blocks to be assembled is at most 3 N (the first two columns, 
the last row, and the blocks A 2,3 and Ajv-i,tv), therefore the complexity is reduced from 
0(N 2 ) to O(N). Each unique non-zero block is assembled and distributed in the following 
way: 

1. An instance of the SparseMatrix class representing the current block A^. j is created. 

2. Approximations of temporal basis functions associated with the current block are 


computed using the relations from |19| and Section 3.1 


3. Using (3.17) the non-zero pattern of the current block A^i is precomputed together 
with pairs of mesh elements contributing to the non-zero entries. This computation 
is parallelized by MPI and OpenMP. 
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Figure 5.3: Example of a system matrix and distribution of its inner blocks A 2 ) 2 ,..., Ajv-i,jv-i 
among five processes with ranks {0,1,... ,4}. Identical blocks are connected by a line (here N = 
8 ,n z = 3). 


4. 

5. 


The pairs of elements are evenly split among available MPI processes. 

Every MPI process iterates through its element pairs and through the order of the 
temporal basis functions and computes its local contribution to the block Aj, -■ using 
(3.6). In order to further save computational time and memory property (3.16) can 
be employed. The iteration through element pairs is parallelized in shared memory 
by OpenMP. 


6. Each process sends its partial results to the owner(s) of the block determined by 
the mapping R. The owner combines the results to form the current block. If any 
identical block is owned by the process, only a reference to the assembled block is 
copied to the appropriate memory location and the data is not duplicated. 


7. All other processes delete their partial results. 


6 Numerical experiments 

In the following section we present results of numerical experiments performed on the 
Anselm cluster located at the IT4Innovations National Supercomputing Centre, Czech 
Republic. The cluster consists of 188 compute nodes, each of them equipped with two 8- 
core Intel Xeon E5-2665 processors and 64 GB of RAM. The theoretical peak performance 
of the cluster is 82 Tflop/s. 


6.1 Convergence tests 


In this section we test the convergence of the space-time Galerkin method introduced 
above. In m, analytic solutions of direct and indirect formulations for Dirichlet and 
Neumann problems on the unit sphere were derived. Such solutions turned out to be 
very useful for numerical experiments and the validation of the algorithm and also serve 


here as reference solutions. We recall these solutions for the problem (2.4) for the sake of 
completeness. 

Let us assume that the scatterer is the unit sphere, i.e., T = § 2 , and that the right-hand 


side in (2.4) is of the form 


9 {x,t) = c?(t)T n m (x) for (x,t) E S 2 X [0,T], 


( 6 . 1 ) 
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where Y™'{x) are spherical harmonics of degree n and order m. We assume g(t ) to be 
causal, i.e., g(t) = 0 for t < 0 and that g(0) = 0. It can be shown (see [15]) that Y™(x) 
are eigenfunctions of the hypersingular operator in the frequency domain, i.e., 


C(W)(s)Y™(x) = \n(s)Y™(x), 

where £ denotes the Laplace transform. The function A n (s) can be expressed explicitely 
in terms of spherical Bessel functions and spherical Hankel functions of the first kind. 
Assuming that the solution of (2.4) for the special right-hand side (6.1) also admits the 
form (f)(t)Y™(x) and exploiting the fact that IT is a convolution with respect to the time 
variable we obtain 


d(j)C 1 j-!- j [t - t) dr. 


( 6 . 2 ) 


Evaluating the inverse Laplace transform in the formula above leads to explicit formulae 
for cj). In the case n = 0, i.e., the right-hand side g(x,t ) = g(t) is purely time-dependent, 
the solution of (2.4) is purely time-dependent as well and is given by 


(j)(x,t) = — 2 / g(t — r) cosh(r) dr 

Jo 

L/2J k 


+ 2^^(-l) fc+1 / c ktl (T-2k) 

k =1 1=1 j2k 


k—l+1 e r— 2k 


g(t - r) dr, 


where 


Ck,l ■ = 


k- 1 


~)k—l 


l-l) {k-l + l)V 

In the case n = 1, i.e., the right-hand side is of the form g(x,t) = < 7 (t)T 1 m (x) the solution 
of (2.4) is given by 

cj)(x, t) = ^—2 J g(t — r) cosh(r) cos(r) dr^j Y™{x) 

for t G [0, 2[. Formulae for large n and t are typically complicated and furthermore difficult 
to evaluate due to numerical cancellation. In this case the one-dimensional problem 


£ 1 {A n } (t - r)</>(r)dr = g(t) 


can be accurately and efficiently solved using the convolution quadrature in order to 
obtain approximations of (6.2) (see, e.g., HD- 


In the following set of numerical experiments we test the convergence of the method 


using the reference solutions introduced above. First, we solve (2.4) on the unit sphere 


§ 2 for t G [0, 6] with the purely time-dependent right-hand side of the form 

g(x,t ) := sin(3t)t 2 e _t Y®(x). 

Note that Y$ is constant on the unit sphere. The error of the numerical solution is 
measured in the L 2 (T; [0,6]) norm. Using the local (temporal) polynomial approxima¬ 
tion space of order p = 1 we obtain the convergence rate of A -1 (see Figure 6.1); the 


convergence for p = 2 is depicted in Figure 6.2 
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Figure 6.1: Left: log-log scale plot of ||^-^Gai e rkin||L2(r ; [o,t]) for g(t) = sin(3 t)t 2 e \n = 0,T = 
6, and local polynomial approximation space of degree p = 1. Right: numerical solution in x £ T 
obtained using 20 timesteps and p = 1. 


For the second numerical experiment we consider the right-hand side 

g(x,t) := sin(27rt)t 3 e~ 2t Y°(x), 


and solve (2.4) with r := § 2 and t € [0, 2], The numerical solution evaluated in the point 
r 3 x = ( 0,0,1) obtained using the temporal approximation space of order p = 2 and its 
convergence to the analytical solution in the L 2 (T; [0, 2]) norm are depicted in Figure 6.3 


6.2 Convergence of iterative solvers 

We demonstrate the performance of the iterative solvers introduced in Section[4]on a set of 
numerical experiments. The tests were performed using a mesh with 320 surface elements, 
the relative precision of the solvers is set to e := ICE 5 . The results for the local polyno¬ 
mial approximation space of degree p = 1 are shown in Tables [l] and [2] in Tables [3] and 
[4] we provide results for p = 2. The convergence history of all solvers is depicted in Fig¬ 
ure [6T4] The arguments m and l in GMRES(m) and DGMRES(m, l) denote the number 
of iterations between restarts and the number of vectors added to the basis of the invari¬ 
ant subspace after each restart, respectively. Moreover, by FGMRES(m, r(ii,..., i r )) we 
denote the solution by flexible GMRES preconditioned as in Section [4.2| with r levels of 
recursion and i\,... ,i r iterations of the inner FGMRES solver on each level. 

The experiments demonstrate that both preconditioners presented in Section [I] lead 
to a significant reduction of iterations and computational time. Moreover, the results 
indicate that only a few iterations of the inner solver are necessary during the application 
of the preconditioner in the case of FGMRES. The best convergence was obtained when 


the solution of (4.2) was approximated by two iterations of FGMRES on the first level 


and ten iterations on the second level of recursion. 


17 













Figure 6.2: Left: log-log scale plot of ||^-^Gai e rkin||L2(r; [o,t]) for g(t) = sin(3 t)t 2 e \n = 0,T = 
6, and local polynomial approximation space of degree p = 2. Right: numerical solution in x £ T 
obtained using 20 timesteps and p = 2. 


GMRES(50) DGMRES(50,4) 

N ff iterations time [s] ff iterations time [s] 


5 

595 

1.6 

246 

0.8 

10 

2121 

14.3 

421 

3.3 

15 

4021 

44.5 

580 

8.5 

20 

5448 

99.0 

510 

14.9 


Table 1: Convergence of GMRES and DGMRES for p = l. 


6.3 Wave scattering off a submarine 

The following numerical experiments serve to demonstrate the behavior of the solver on 


more complex geometries, such as the submarine-shaped scatterer depicted in Figure 6.5 


We solve the sound-hard scattering problem for t £ [0, 6] with the planar incident wave 

mm 

inc / _ / A COS (kx + (po ~ COt), ut — Ulf > kx > Ujt — Ult, 

U a "’ \ 0, otherwise, 

where A = 0.02, k = (— n/y/2, 0.0, — 7t/\/2), uj = n, rrif = 67T, and m* = 8ir. The 
Neumann boundary condition (2. l{c ) is given by 


Ifn 0 — 


du 11 


dn 


__ ( Asin(A:x + (fio ~ ut)kn , ujt — rrif > kx > u>t — mt, 
0, otherwise, 


with n being the unit outward normal vector to T. We decompose the boundary of the 
scatterer into 5604 surface elements, the time interval into 40 equidistant time steps and 
use the local polynomial approximation space of order p = 1. 


On the left-hand side of Figure 6.6 the scalability of the system matrix assembly up to 
1024 cores is depicted. The assembly time was reduced from 5702 s on 16 cores to 217 s on 


18 














10' 1 



5 10 20 40 

Number of timesteps 



t 


Figure 6.3: Left: log-log scale plot of ||<£ - ^Gaierkin||z,2(r ; [o,r]) for g(t) = sin(27ri)f 3 e 2t ,n = 
1, T = 2, and local polynomial approximation space of degree p = 2. Right: numerical solution in 
x £ r obtained using 20 timesteps and p = 2. 


FGMRES(50,1(10)) FGMRES(50,1(5)) FGMRES(50,2(2,10)) 
N ff iterations time [s] ff iterations time [s] ff iterations time [s] 


5 

24 

0.7 

45 

0.9 

23 

0.8 

10 

43 

3.1 

126 

6.8 

26 

3.3 

15 

51 

7.3 

205 

20.0 

28 

5.9 

20 

48 

9.7 

341 

51.2 

34 

10.6 


Table 2: Convergence of FGMRES with recursive preconditioner for p = 1. 


1024 cores. The corresponding parallel efficiency with respect to a single computational 
node is depicted in the right-hand side of Figure [6~6l The main bottleneck influencing the 
scalability is currently caused by the MPI communication during the non-zero pattern 
computation and during the gathering of matrix contributions from MPI processes. 

Finally, Figure 6.7 depicts the solution (sum of incident and scattered wave) in the 
vicinity of the scatterer at time t = 3.5 s. The double layer potential was evaluated in 
66049 points for 40 timesteps. The evaluation took 327 s using 64 computational nodes. 


7 Conclusion 

We considered the numerical modelling of time-dependent, three-dimensional sound-hard 
scattering phenomena in unbounded domains. We used time-domain boundary integral 
equations in combination with space-time Galerkin methods to solve the arising problems 
numerically. The use of C^-smooth and compactly supported temporal basis functions 
simplifies the generation of the system matrix and as a consequence allows the use of 
higher order approximation schemes in time which significantly improves the accuracy of 
the method. Due to the overlap of the basis functions in time the arising linear systems 
that need to be solved admit a block Hessenberg structure. We examined the performance 
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GMRES(50) DGMRES(50,2) 

N ff iterations time [s] ff iterations time [s] 


5 1985 

14.3 

1434 

11.6 

10 5404 

89.1 

3834 

67.9 

15 4634 

132.7 

1491 

52.1 

20 6383 

293.3 

1269 

68.1 

Table 3: Convergence 

of GMRES and DGMRES for p = 2 


N 

FGMRES(50,1(10)) 
ff iterations time [s] 

FGMRES(50,1(5)) 
ff iterations time [s] 

FGMRES(50,2(2,10)) 
ff iterations time [s] 

5 

83 

5.5 

210 

8.9 

— 

— 

10 

257 

44.3 

627 

77.3 

94 

26.6 

15 

148 

48.3 

363 

82.2 

56 

24.8 

20 

91 

45.9 

399 

139.4 

63 

40.4 


Table 4: Convergence of FGMRES with recursive preconditioner for p = 2. 


of a restarted GMRES solver preconditioned by deflation and proposed an algebraic pre¬ 
conditioner that exploits the block Hessenberg structure of the matrix. Various examples 
show that both solvers require a considerably lower number of iterations than a conven¬ 
tional GMRES method. Especially the second approach is promising since the overall 
solution time could be significantly reduced. A rigorous analysis and a possible parallel 
implementation of the preconditioner, however, will be considered in the future. 

We furthermore presented a parallel implementation of the scheme that is based on 
the boundary element library BEM4I. It exploits the special structure of the system 
matrix to reduce computational complexity and memory requirements. It uses OpenMP 
and MPI for parallelization in shared and distributed memory. Numerical experiments 
show good parallel scalability and efficiency of the computations in a distributed memory 
architecture with up to 1024 cores. 
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Figure 6.4: Convergence history of GMRES, DGMRES, and FGMRES with recursive precondi¬ 
tioner (p = 1,N = 20). 



Figure 6.5: Submarine-shaped computational geometry. 




Optimal scalability —Assembly time ♦ Optimal efficiency Parallel efficiency 


Figure 6.6: Parallel scalability and efficiency of the system matrix assembly up to 1024 cores 
(64 nodes). 
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Figure 6.7: Solution (sum of incident and scattered wave) at time t = 3.5s. 
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